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Abstract: The complete genome of severe acute respiratory 
syndrome coronavirus (SARS-CoV) reveals the existence 
of putative proteins unique to SARS-CoV. Identification 
of their function facilitates a mechanistic understanding 
of SARS infection and drug development for its treatment. 
The sequence of the majority of these putative proteins 
has no significant similarity to those of known proteins, 
which complicates the task of using sequence analysis 
tools to probe their function. Support vector machines 
(SVM), useful for predicting the functional class of dis- 
tantly related proteins, is employed to ascribe a possible 
functional class to SARS-CoV proteins. Testing results 
indicate that SVM is able to predict the functional class 
of 73% of the known SARS-CoV proteins with available 
sequences and 67% of 18 other novel viral proteins. A 
combination of the sequence comparison method BLAST 
and SVMProt can further improve the prediction accuracy 
of SMVProt such that the functional class of two additional 
SARS-CoV proteins is correctly predicted. Our study 
suggests that the SARS-CoV genome possibly contains 
a putative voltage-gated ion channel, structural proteins, 
a carbon—oxygen lyase, oxidoreductases acting on the 
CH—OH group of donors, and an ATP-binding cassette 
transporter. A web version of our software, SVMProt, is 
accessible at http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi. 
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Introduction 


Following the identification of a novel coronavirus as the 
cause of severe respiratory syndrome (SARS),!~* the complete 
genome of this virus has been determined?* and its sequence 
variations in different isolates have been analyzed.®° The SARS 
coronavirus (SARS-CoV) genome contains five major open- 
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reading frames (ORFs) which encode the replicase polyprotein, 
spike glycoprotein (S), small envelop protein (E), membrane 
glycoprotein (M), and nucleocapsid protein (N) found in other 
coronaviruses.*-° Moreover, nine potential ORFs unique to 
SARS-CoV have been identified.° While it is unclear which of 
these ORFs are translated in infected cells, the possibility that 
some of them may serve novel functions® raises great interest 
in probing their function. 

The sequence of the majority of these putative proteins has 
no significant similarity to those of known proteins,? which 
complicates the task of using sequence analysis tools to probe 
their potential function. A statistical learning method, support 
vector machines (SVM), has recently been applied to protein 
functional classification,’~® fold recognition,!° analysis of sol- 
vent accessibility,!' prediction of secondary structures,!* and 
protein—protein interactions.'*!4 As a method that uses se- 
quence-derived physicochemical properties of proteins as the 
basis for classification, SVM has shown some potential for 
predicting the functional class of distantly related proteins and 
homologous proteins of different functions.*°° It may thus be 
a useful method to complement sequence alignment, cluster- 
ing, and motif-based methods in functional characterization 
of novel proteins. A web-based SVM protein functional clas- 
sification software SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/ 
svmprot.cgi)® is used in this work to ascribe possible functional 
roles of SARS-CoV putative proteins. 


Method 


SVMProt’ is a group of integrated classification systems that 
use a Statistical learning method—support vector machines 
(SVM)!°!’—for predicting the functional class of a protein from 
its primary sequence, irrespective of sequence similarity. It 
currently covers 97 protein functional classes including 46 
enzyme families, 21 channel/transporter families, 5 RNA- 
binding protein families, DNA-binding proteins, G-protein- 
coupled receptors, nuclear receptors, tyrosine receptor kinases, 
cell adhesion proteins, coat proteins, envelope proteins, trans- 
membrane proteins, outer membrane proteins, structural 
proteins, growth factors, and antigens. Until now, the major- 
ity of known types of viral proteins are included in these 
classes. 

Representative proteins of a particular functional class 
(positive samples) and those which do not belong to this class 
(negative samples) are needed to train a SVMProt classifier for 
this class. The positive samples of a class are constructed by 
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using all of the known distinct protein members in that class. 
Because of the enormous number of proteins, the size of 
negative samples needs to be restricted to a manageable level 
by using a minimum set of representative proteins. One way 
for choosing representative proteins is to select one or a few 
proteins from each protein domain family. The negative 
samples of a class are selected from seed proteins of the 7316 
curated protein families (domain-based) in the Pfam database, 
excluding those families that have at least one member 
belonging to the functional class. Pfam families are constructed 
on the basis of sequence similarity. The purpose of using Pfam 
proteins is to ensure that the negative samples are evenly 
distributed in the protein space. Sequence similarity is not 
required for selecting positive samples. In this sense, SVMProt 
is to some extent independent of sequence similarity. 


The SVMProt training system for each class is optimized and 
tested by using separate testing sets of both positive and 
negative samples. While possible, all of the remaining distinct 
proteins in each functional family (not in the training set of 
that family) are used as positive samples and all of the 
remaining representative seed proteins in Pfam curated families 
are used to construct negative samples in a testing set. The 
performance of SVMProt classification is further evaluated by 
using independent sets of both positive and negative samples. 
There is no duplicate protein in each training, testing, or 
independent evaluation set. 


Data set construction can be demonstrated by an illustrative 
example of viral coat proteins. The keyword “virus coat protein” 
is used to search the Swiss—Prot database, which finds 3012 
entries. These entries are checked to remove noncoat proteins, 
redundant entries, and putative proteins, which gives 848 
positive samples. These positive samples cover 140 Pfam 
families; thus, 14 758 seed proteins of the remaining 7176 Pfam 
families are used as the negative samples. These positive and 
negative samples are further divided into 346 and 1474 training, 
305 and 8370 testing, and 197 and 4914 independent evaluation 
sets, using the procedure described above. 


Not all of the SVMProt classes are at the same hierarchical 
level. These classes are mixtures of subfamilies, families, and 
superfamilies. Some classes, such as antigen, need to be more 
clearly defined into specific subclasses. While it is desirable to 
define all of the classes at the same level, this is not yet possible 
because of insufficient data for the subhierarchies of some 
families and superfamilies. Effort is being made to collect 
sufficient data so that SVMProt classification systems can be 
constructed on the basis of more evenly distributed family 
structures. 


Nonetheless, prediction on the basis of the current structures 
provides a useful hint about the functional class of a protein. 
SVMProt is trained for protein classification in the following 
manner. First, every protein sequence is represented by a 
specific feature vector assembled from encoded representations 
of tabulated residue properties, including amino acid composi- 
tion, hydrophobicity, normalized van der Waals volume, polar- 
ity, polarizability, charge, surface tension, secondary structure, 
and solvent accessibility, for each residue in the sequence.* The 
feature vectors of the positive and negative samples are used 
to train a SVMProt classifier. The trained SVMProt classifier can 
then be used to classify a protein into either the positive group 
(the protein is predicted to be a member of the class) or the 
negative group (the protein is predicted to not belong to the 
class). 
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Support Vector Machine (SVM) is a promising algorithm for 
binary classification by means of supervised learning, which 
was originally developed by Vapnik and co-workers. The theory 
of SVM has been described in the literature.!® Thus, only a brief 
description is given here. SVM is based on the structural risk 
minimization (SRM) principle from statistical learning theory.'® 
In linearly separable cases, SVM constructs a hyperplane that 
separates two different groups of feature vectors with a 
maximum margin. A feature vector is represented by x;, with 
physicochemical descriptors of a protein as its components. 
The hyperplane is constructed by finding another vector w and 
a parameter b that minimizes ||w||?and satisfies the following 
conditions 


w:x; + b= +1, for y, = +1 Group 1 (positive) (1) 


7 


wx; + b < —1, for y; = —1 Group 2 (negative) (2) 


where y; is the group index, w is a vector normal to the 
hyperplane, |b|/||w|| is the perpendicular distance from the 
hyperplane to the origin and ||w||? is the Euclidean norm of w. 
After the determination of w and J, a given vector x can be 
classified by 


sign[(w-x) + b] (3) 


In nonlinearly separable cases, SVM maps the input variable 
into a high dimensional feature space using a kernel function 
K(xi, x;). An example of a kernel function is the Gaussian kernel 
that has been extensively used in different protein classification 
studies:7:19—13,16,18 


K(&;xX) = en lPs-xill?/202 i 


Linear support vector machine is applied to this feature space 
and then the decision function is given by 


1 
fx) = sign() a yK(%x) + b) (5) 
i=1 


where the coefficients a,° and b are determined by maximizing 
the following Langrangian expression 


l 
Ya- 
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under the following conditions: 
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1 
a; = 0 and Yossi =0 (7) 
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A positive or negative value from eq 3 or eq 5 indicates that 
the vector x belongs to the positive or negative group, respec- 
tively. 

Scoring of the SVM classification of proteins has been 
estimated by a reliability index, and its usefulness has been 
demonstrated by statistical analysis.*!* A slightly modified 
reliability score, R— value, is used in SVMProt 


1 if d<0.2 
R-value = | d/0.2+ 1 if 02<d<1.8 (8) 
10 if d=1.8 


where d is the distance between the position of the vector of a 
classified protein and the optimal separating hyperplane in the 
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Table 1. Novel Viral Proteins, Their SVMProt Predicted Functional Classes and Functions Suggested from Experiment and/or 


Sequence Analysis? 


Function suggested by 


Functional classes characterized 


experiment and/or by SVMProt 
Protein (SwissProt sequence-analysis (probability of correct Prediction 
and NCBI ID) Virus (reference) characterization) status 
CRV3 Q80MM6 Cotesia rubecula virus A novel protein with homology Lectin (99.0%) + 
(AY234855) to C-type lectin?! EC 3.1.-.-:. Hydrolase — Acting 
on Ester Bonds (73.8%) 
SPLT137 SpLtMNPV virus A noval envelope protein*® TC 3.A.5: Type II (general) = 
(NP_258405) secretory pathway family (58.6%) 
EIA 13S protein Human adenovirus Formed transcription complex*° DNA-binding Protein (99.2%) c 
(Q8JSK1) type 21 Cell adhesion (71.3%) 
(AF492353) Outer membrane (58.6%) 
P14 (Q38563) Bacteriophage phi-6 A new small low-abundant EC3.4.-.-: Peptidase (58.6%) c 
(AAC60530) nonstructural protein, facilitated 
packaging or host-cell 
membrane repair®* 
V cath AcMNPV Cathepsin-like protease EC 3.4.-.-:. Hydrolases— te 
(P25783) (EC3.4.22.50)22 Peptidase (99.0%) 
EC 4.1.-.-: Carbon—Carbon 
Lyases (68.5%) 
EC 1.2.-.-: Oxidoreductases— 
Acting on the aldehyde or oxo 
group of donors (68.5%) 
EC 2.1.-.-: Transferase of One- 
Carbon Groups (58.6%) 
TC 3.A.5: Type II (general) 
secretory pathway family (58.6%) 
MotA protein bacteriophage T4 DNA-binding, transcription DNA-binding Proteins (99.0%) te 
(P22915) regulation? EC 3.1.-.-:. Hydrolase— 
Acting on Ester Bonds (68.5%) 
TC 3.A.5: Type II (general) 
secretory pathway family (58.6%) 
TC 3.A.1: ATP-binding cassette 
family (58.6%) 
M3 protein Murine gamma soluble chemokine receptor*+ Transmembrane (99.0%) PC 
(041925) herpesvirus 68 Cell adhesion (82.2%) 
EC 3.4.-.-: Peptidase (62.2%) 
TC 3.A.3: P-type ATPase 
family (58.6%) 
7 transmembrane receptor 
(Secretin family) (58.6%) 
TC 1.C.: Channels/Pores— 
Pore-forming toxins (58.6%) 
BFRF1 protein Epstein-barr virus localized on the EC 2.7.-.-: Transferases of - 
(P03185) (strain B95—8) plasma membrane and Phosphorus-Containing 
nuclear compartments Groups (88.1%) 
of the cells and is EC 4.1.-.-: Carbon—Carbon 
a structural component Lyase (58.6%) 
of the viral particle*® 
VSVG Vesicular stomatitis Transmembrane glycoprotein”® Transmembrane (99.0%) + 
(Q89570) virus Aptamer-binding protein 
(94.7%) 
EC 3.4.-.-: Peptidase (92.1%) 
Coat protein (88.1%) 
EC 2.7.-.-: Transferases of 
Phosphorus-Containing 
Groups (83.9%) 
EC 1.18.-.-: Oxidoreductases— 
Acting on iron-sulfur proteins 
as donors (73.8%) 
VCP (P68639) Vaccinia virus A novel complement No function predicted - 
control protein, 
binds to C3b and C4b*° 
Major structural Human Self-assembles into Coat protein (73.8%) + 
protein L1 (Q8V1L7) papillomavirus viral particles and 


(AF459425) 


bind to a cell-surface 
receptor, coat protein, 
capsid formation”® 
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Table 1. (Continued) 


technical notes 


Function suggested by 


Functional classes characterized 


experiment and/or by SVMProt 
Protein (SwissProt sequence-analysis (probability of correct Prediction 
and NCBI ID) Virus (reference) characterization) status 
FALPE Amsacta moorei Associated with unique EC 2.7.-.-: Transferases of = 
(Q65010) Entomopoxvirus cytoplasmic Phosphorus-Containing 
structures, filament- Groups (58.6%) 
associated protein*! 
EUS2 (ORF69) Equine herpesvirus-1 Serine-threonine kinase EC 2.7.-.-: Transferase of + 
(P28926) (EC 2.7.1.37)2? Phosphorus-Containing 
Groups (99.0%) 
Virulence factor Human herpes Complexes with PCNA DNA-binding Proteins PC 


ICP34.5 (P36313) 


Putative BARFO 
protein (Q8AZJ4) 


35k myristylprotein 


(093122) 


ICP6 (039263) 


Protein IRS1 
(P09715) 


simplex virus 1 


Epstein—Barr virus 


Shope fibroma 
poxvirus 


HSV-1 


Human 
cytomegalovirus 


hich forms part 

of replication machinery?! 
Membrane associated 

and encodes three 

arginin-rich motifs of 

RNA-binding properties’ 


a soluble secreted 
form of an acquired 
cellular receptor for 
tumor necrosis factor’ 
a novel protein kinase 
enzymatic activity 
(EC 1.17.4.1)78 
Competes for binding 
to DNA recognition 
site of another protein?9 


(62.2%) 

Cell adhesion (58.6%) 

EC 4.1.-.-: Carbon—Carbon 
Lyase (58.6%) 

TC 3.A.15: The Outer 


Membrane Protein Secreting 


Main Terminal Branch 

family (58.6%) 
Transmembrane (62.2%) 

Outer membrane (58.6%) 


EC 1.17.-.-: Oxidoreductase— 


Acting on CH2 groups (99.1%) 


Outer membrane (58.6%) 
DNA-binding Protein (83.9%) 
EC 3.1.-.-:. Hydrolase— 


Acting on Ester Bonds (76.2%) 


Outer membrane (58.6%) 


“The symbols +, C, PC, and — represent the cases in which one of the SVMProt predicted functional class is in agreement, consistent, partially consistent, 
and not matching the suggested function from experiment and/or sequence analysis. 


hyperspace. There is a statistical correlation between the 
R-value and the expected classification accuracy (probability 
of correct classification).*!* Thus, another quantity, the P-value, 
is introduced to indicate the expected classification accuracy. 
The P-value is derived from the statistical relationship between 
the R-value and the actual classification accuracy based on the 
analysis of 9932 positive and 45999 negative samples of 
proteins.® 


Results and Discussion 


Test of the Capability of SVMProt for Predicting Functional 
Class of Novel Viral Proteins. Eighteen novel viral proteins with 
available sequence and function information, searched from 
Medline!’ abstracts published during from 1987 to 2003 are 
used to test SVMProt. These proteins are described as novel 
or new in their respective abstracts. BLAST*° analysis shows 
that 10 of these are with no significant sequence similarity to 
known proteins, and three others are with homology to no 
more than five distinct proteins (sequence similarity score 
e-value < 0.05). The remaining five proteins possess homology 
to a relatively small number of proteins from primarily a few 
viruses in several different species. Table 1 gives SVMProt 
ascribed functional class for each protein together with litera- 
ture described function. More than one functional class may 
be characterized by SVMProt and the probability of correct 
prediction for each functional class can be estimated by using 
a statistical method,® which is also given in Supporting Infor- 
mation Table 1. 

There are nine proteins with one of its SVMProt character- 
ized functional classes matching that described in the literature. 
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These are CRV3 of Cotesia rubecula virus,?! V-cath of ACMNPV,22 
MotA protein of bacteriophage T4,*3 M3 protein of Murine 
gamma herpesvirus 68,*4 VSVG of Vesicular stomatitis virus,” 
major structural protein L1 of Human papillomavirus,® EUS2 
(ORF69) of Equine herpesvirus-1,?’ ICP6 of HSV-1,78 and Protein 
IRS1 of Human cytomegalovirus.”® Two other proteins, E1A 13S 
protein of Human adenovirus 21°° and virulence factor ICP34.5 
of Herpes simplex virus,*! are characterized as DNA-binding, 
which is consistent with the finding that the first protein is part 
of a transcription complex, and the second is part of a 
replication complex that bind to the viral DNA. Moreover, P14 
of bacteriophage phi-6 is predicted to be a protein in the EC3.4 
family. A likely candidate for this protein is metalloproteinase, 
having a function consistent with the reported role of P14 in 
facilitating viral packaging or host-cell membrane repair.** 
Overall, 67% of the novel viral proteins have one of its SVMProt 
characterized functional classes to be consistent with that 
described in the literature. 


SVMProt Prediction of the Functional Class of SARS-CoV 
Proteins. The sequence of each individual protein or putative 
protein contained in the NCBI entry NC_004718 of the com- 
plete SARS-CoV genome?’ is used for predicting its functional 
class. The SVMProt characterized functional classes of SARS- 
CoV proteins are given in Table 2 together with the estimated 
probability of correct prediction and the suggested function 
from experiment or sequence alignment for some of these 
proteins.® There are fifteen proteins with function derived from 
experiment or sequence analysis.° These are 3C-like proteinase, 
NSP3, NSP4, NSP6, NSP9, NSP10, NSP13, NSP14, NSP15, RNA- 
dependent RNA polymerase, putative ribose 2’-O-methyltrans- 
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Table 2. SVMProt Predicted Functional Class of SARS-Associated Coronavirus Proteins Together with the Function Suggested by 


Experiment or Sequence Analysis? 


Function suggested by 
experiment, or sequence 


Functional classes 
characterized by 


analysis from ref 5 and description SVMProt (probability Prediction 
Protein in NCBI entry NC_004718 of correct characterization) status 
Counterpart of MHV Unknown EC 2.6.-.-: Transferases of 2 
p65 protein Nitrogenous Groups (86.8%) 
EC 1.1.-.-: Oxidoreductases— 
Acting on the CH—OH group of 
donors (62.2%) 
3CL-PRO 3C-like proteinase (EC 3.4.22.-) EC 3.4.-.-: Peptidase (62.2%) + 
nsp3 predicted phosphoesterase Transmembrane (98.5%) PC 
(EC3.1.-.-) (similar to the EC 3.6.-.-:. Hydrolases (83.9%) 
ppr-1’—p processing enzyme) Outer membrane (58.6%) 
formerly known as ‘X-domain’, TC 3.A.3: P-type ATPase family 
PL-PRO (EC 3.4.22.-) similar (58.6%) 
to that of MHV PL2-PRO, 
Y-domain; Transmembrane 
domain 1; adenosine diphosphate- 
ribose 1”-phosphatase (ADPR) 
(EC 3.6.1.-) 
nsp4 Contains transmembrane G Protein Coupled Receptors + 
domain 2 98.9%) 
Transmembrane (98.8%) 
7 transmembrane receptor 
rhodopsin family and 
chemoreceptor) (62.2%) 
7 transmembrane receptor 
metabotropic glutamate 
family) (58.6%) 
7 tarnsmembrane receptor 
Secretin family) (58.6%) 
TC 2.A.1: Major facilitator 
family (58.6%) 
TC 3.A.5: Type II (general) 
secretory pathway family (58.6%) 
nsp6 putative transmembrane Transmembrane (99.1%) + 
domain TC 2.A.1: Major facilitator 
family (58.6%) 
nsp7 Unknown TC 3.4.15: The Outer Membrane 2 
Protein Secreting Main Terminal 
Branch family (58.6%) 
TC 3.A.1: ATP-binding cassette 
family (58.6%) 
nsp8 Unknown EC 2.3.-.-: Acyltransferases 2 
(58.6%) 
EC 4.2.-.-:_ Carbon—Oxygen 
Lyases (58.6%) 
nsp9 ssRNA-binding protein EC 2.4.-.-: Glycosltransferases = 
(experimental)°63” (76.2%) 
DNA-binding Proteins (58.6%) 
nsp10 formerly known as growth- Outer membrane(58.6%) = 
factor-like protein 
nsp12 (RNA-dependent RNA-dependent RNA Transmembrane (83.9%) + 
RNA polymerase) polymerase (EC 2.7.7.48) EC 4.1.-.-:_ Carbon—Carbon 
Lyases (76.2%) 
EC 2.7.-.-: Transferases of 
Phosphorus-Containing Groups 
(62.2%) 
nsp13 zinc-binding domain (ZD), Transmembrane (97.3%) + 
NTPase/helicase** domain Envelope protein (96.7%) 
(EC3.6.1.-). RNA 5’-triphosphatase EC 3.6.-.-:. Hydrolases — Acting 
(EC 3.6.1.-) on Acid Anhydrides (62.2%) 
nsp14 3’-to-5’ exonuclease (EC 3.1.-.-) EC 3.4.-.-: Peptidases (76.2%) = 
nsp15 uridylate-specific EC 1.1.-.-: Oxidoreductases— = 


endoribonuclease 
NendoU (EC3.1.-.-) 


Acting on the CH—OH group of 

donors (97.0%) 

EC 2.7.-.-: Transferases of 
Phosphorus-Containing Groups 
(78.4%) 

EC 4.2.-.-:_ Carbon—Oxygen 
Lyases (65.4%) 
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Table 2. (Continued) 


technical notes 


Function suggested by 
experiment, or sequence 


Functional classes 
characterized by 


analysis from ref 5 and description SVMProt (probability Prediction 
Protein in NCBI entry NC_004718 of correct characterization) status 
putative ribose 2’-O-ribose EC 2.6.-.-: Transferases of +p 
2’-O-methyltransferase methyltransferase Nitrogenous Groups (92.1%) 
(EC 2.1.1.-) EC 3.1.-.-:. Hydrolases—Acting 
on Ester Bonds (73.8%) 
EC 4.1.-.-: Carbon—Carbon 
Lyase (62.2%) 
EC 2.4.-.-: Glycosyltransferases 
(58.6%) 
EC 2.1.-.-: Transferase of One- 
Carbon Groups (58.6%) 
S protein spike glycoprotein*® Transmembrane (99.0%) of 
Aptamer-binding protein (93.6%) 
Envelope protein (92.9%) 
Orf3 (sars3a) No significant similarity Transmembrane (97.5%) + 
to known proteins, three TC 1.A.1: Voltage-gated ion 
transmembrane regions, channel family (58.6%) 
signal peptide, ATP-binding 
properties 
Orf4 (sars3b) No significant similarity to Transmembrane (58.6%) + 
known proteins, a single TC 3.A.1: ATP-binding cassette 
transmembrane helix family (58.6%) 
TC 1.C.: Channels/Pores—Pore- 
forming toxins (58.6%) 
E protein Small envelope protein Transmembrane (99.0%) =e 
EC 1.9.-.-: Oxidoreductases— 
Acting on a heme group of 
donors (62.2%) 
EC 3.6.-.-:. Hydrolases—Acting 
on Acid Anhydrides (58.6%) 
Envelope protein (58.6%) 
M protein membrane glycoprotein; Transmembrane (98.8%) + 
Matrix Protein Structural protein (Matrix protein, 
Core protein, Viral occlusion 
body,Keratin) (89.3%) 
Orf7 (sars6) No significant similarity to No function predicted a 
known proteins, a likely 
transmembrane helix between 
residue 3 and 22. 
Orf8 (sars7a) No significant similarity to known Transmembrane (76.2%) + 
proteins, a cleaved signal 
sequence, a transmembrane helix 
Orf9 (sars7b) weakly similar to sterol-C5 Transmembrane (58.6%) + 
desaturase (EC1.3.-.-), a single 
strong transmembrane helix 
Orf10 (sars8a) no significant sequence similarity Transmembrane (58.6%) + 
to known proteins, a 
transmembrane helix with one 
end within viral particle 
Orf11 (sars8b) Matches to a region of human Outer membrane (58.6%) 2 
coronavirus E2 glycoprotein RNA-binding Protein (58.6%) 
precursor, a soluble protein 
N protein (sars9a) nucleocapsid protein*® RNA-binding Protein (58.6%) Cc 


Orf13 (sars9b) 


“The symbols +, C, PC, and — are described in Table 1. The symbol “?” indicates that the currently available information is insufficient to determine 


prediction status. 


Unknown, no transmembrane 
helix 
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Structural protein (Matrix protein, 
Core protein, Viral occlusion 
body, Keratin) (71.3%) 

Outer membrane (58.6%) 

EC 4.1.-.-: Carbon—Carbon 
Lyase (58.6%) 

EC 2.8.-.-: Transferases of Sulfur- 
Containing Groups (58.6%) 

EC 4.2.-.-: Carbon—Oxygen Lyase 
(58.6%) 


technical notes 


ferase, and the S, E, M, and N proteins. The sequences of all of 
the above proteins are given in NCBI entry NC_004718. Eleven 
out of these fifteen proteins with known sequence information 
have one of the SVMProt ascribed functions, consistent with 
that predicted from experiment or sequence analysis. 

The replicase polyprotein is known to contain three other 
potential protein-encoding subunits. One of these is the 
counterpart of the MHV p65 protein. SVMProt ascribed it as 
either an EC2.6 transferase of nitrogenous groups (87%) or an 
EC1.1 oxidoreductase (62%). Viral protein with EC1.1 oxi- 
doreductase has been found,** and there is no report about a 
viral protein that belongs to the EC2.6 family. Therefore, it 
seems that this protein is an enzyme with EC1.1 oxidoreductase 
activity. The other two proteins are the nonstructural proteins 
nsp7 and NSP8. The function of NSP7 and NSP8 has not been 
determined and no functional motif/domain for each of these 
proteins has been reported. SVMProt characterizes NSP7 as a 
member of the transporter TC3.A.15 family or TC3.A.1 family. 
There is evidence of viral protein in the ATP-binding cassette 
(TC3.A.1) family,*4 whereas there is no report about a viral outer 
membrane protein secreting main terminal branch. Hence, 
NSP7 is likely a transporter of the ATP-binding cassette family. 
NSP8 is classified as either an EC2.3 acyltransferase (59%) or 
an EC4.2 carbon—oxygen lyase (59%). Viral protein with EC4.2 
lyase activity has been found,** and there is no report about 
that with acyltransferase activity. Thus, it is likely that NSP8 is 
an enzyme with EC4.2 lyase activity. 

The putative protein ORF3 is characterized as a transmem- 
brane protein (97.5%) and a transporter in the TC1.A.1 family 
(voltage-gated ion channel) by SVMProt. The predicted trans- 
membrane property is consistent with the reported identifica- 
tion of three transmembrane regions within this protein.° Since 
proteins of the TC1.A.1 family have been found in a wide range 
of viruses as well as bacteria, archaea, and eukaryotes (http:// 
www.tcdb.org), it is possible that ORF3 is also a voltage-gated 
ion channel. SVMProt ascribed three function classes for ORF4, 
which are transmembrane (59%), TC3.A.1 ATP-binding cassette 
(59%) and TC1.C. Channels/Pores-Pore forming toxins (59%). 
The characterized transmembrane property is consistent with 
the described identification of a single transmembrane helix 
about this protein.® No viral protein of TC1.C (Channels/Pores- 
Pore forming toxins) family has been found. Thus, it is possible 
that ORF4 is a protein of the TC3.A.1 ATP-binding cassette 
family. ORF4 overlaps with ORF3 and E protein, but no 
potential TRS sequence can be found at the 5’ end of this 
protein, which led to the suggestion that it might be expressed 
from the ORF3 mRNA using an internal ribosomal entry site.° 

SVMProt fails to ascribe a function for ORF7. This putative 
protein has no significant sequence similarity with known 
proteins. An analysis of this putative protein indicated a likely 
transmembrane helix between residues 3 and 22 with the N 
terminus located outside the viral particle.©5 ORF8, ORF9, and 
ORF10 are characterized as transmembrane proteins by SVM- 
Prot. The predicted transmembrane property for these putative 
proteins is consistent with reports from a study of SARS-CoV 
genome, which shows that each of these putative proteins 
contains a transmembrane helix.* FASTA analysis suggested 
weak similarities of ORF9 with a sterol-C5 desaturase and a 
hypothetical Clostridium perfringens protein.® But SVMProt is 
unable to provide additional information about the function 
for this and the other two putative proteins. 

ORF11 is predicted as either an outer membrane protein 
(59%) or an RNA-binding protein (59%). A section of this 
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putative protein is known to match to a region of human 
coronavirus E2 glycoprotein precursor. But it is unclear which 
function is more likely for this protein. ORF13 is characterized 
as either a structural protein (71%), or an EC4.1 carbon—carbon 
lyase (59%), or an EC2.8 transferase of sulfur-containing groups 
(59%), or an EC4.2 carbon—oxygen lyase (59%), or an outer 
membrane protein (59%). There has been no report about the 
possible function of this putative protein other than the finding 
that no transmembrane helix is detected within this protein.® 
Structural proteins include matrix proteins, core proteins, and 
viral occlusion body proteins found in various viruses. No viral 
protein of the EC4.1 or EC2.8 family has been found. So it is 
likely that ORF13 is either a structural protein, or an outer 
membrane protein, or an EC4.2 carbon—oxygen lyase. But it 
remains to be determined which is a more likely function for 
this protein. 

Can Combination of BLAST and SVMProt Improve the 
Prediction Accuracy of SVMProt? It is of interest to examine 
whether SVMProt prediction of the SARS-CoV proteins can be 
further improved if it is combined with sequence alignment. A 
BLAST search is conducted to find similarity proteins in other 
coronaviruses for each of the 4 SARS-CoV proteins whose 
functional class is incorrectly predicted by SVMProt. SVMProt 
is then used to predict the functional class of the corresponding 
similarity proteins of each of these proteins. It is found that 
the functional class of two of these 6 SARS CoV proteins, NSP9 
and NSP14, can be correctly predicted by such an approach. 
P12 chain in pplab of HCoV-229E, which is similar in sequence 
to NSP9 based on BLAST search result, is predicted to be an 
RNA-binding protein by SVMProt, which is consistent with 
experimental suggestion that NSP9 is a single-stranded RNA- 
binding protein.***’ A fragment in Replicase 1b of HCoV, which 
is similar in sequence to NSP14 based on BLAST search result, 
is predicted to be an EC 3.1 hydrolase acting on an ester bond, 
which is consistent with the annotation that NSP14 is a 3’—5’ 
exonuclease (EC 3.1.—.—). Therefore, our study suggests that 
combination of BLAST and SVMProt can to some extent 
improve the prediction accuracy of SVMProt. SVMProt predic- 
tion of the functional classes of the encoded proteins in all of 
the available coronavirus genomes are given in the Supporting 
Information. 


Conclusion 


SVMProt shows a certain level of capability for predicting 
the functional class of a number of novel viral proteins, the 
majority of which are distantly related proteins. It is also able 
to predict the functional class of 73% of the SARS-CoV proteins 
with known function. Our analysis provides the functional 
classes of some of the putative proteins in SARS-CoV, which is 
subject to further validation. The functional class of two 
additional SARS-CoV proteins can be predicted by combining 
the BLAST sequence comparison method with SVMProt. Our 
study suggests that combined use of different methods can 
facilitate the functional study of SARS-CoV proteins and other 
novel proteins, which assists in the mechanistic study of SARS 
and other diseases and the development of therapeutics for 
their treatment. 


Acknowledgment. This work was supported in part 
by grants from Singapore ARF R-151-000-031-112, the Shanghai 
Commission for Science and Technology (04DZ19850), and 
the “973” National Key Basic Research Program of China 
(2004CB715901). 


Journal of Proteome Research e Vol. 4, No. 5, 2005 1861 


Functional Class of the SARS Coronavirus Proteins 


Supporting Information Available: SVMProt pre- 
dicted functional class of the encoded proteins in five coro- 
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